Resilience to Device Driver Failures Using Virtualization
نویسندگان
چکیده
Faulty device drivers are a significant cause of system failures. Low overhead mechanisms that leverage virtualization can detect and recover from device driver failures without requiring modifications to the device driver, applications, or OS running in the VMs. These mechanisms can var y significantly in terms of coverage, recovery latency, and implementation complexity. This paper explores the design space of such mechanisms, provides a taxonomy for their character ization, and evaluates key points in the design space. Based on full implementations of a variety of mechanisms, design tradeoffs are described and key implementation challenges are identified. Schemes are evaluated on a var iety of system configurations with multiple devices and multiple VMs running applications. Extensive fault injection campaigns are used to evaluate the effectiveness of the different mechanisms. It is shown that simple recovery schemes, transparent to the VMs running applications, can effectively recover from a ver y high percentage of faults. How ever, in order to minimize ser vice interr uption duration, it is necessary to use schemes that are slightly more complex, involving redundant device controllers. Index Terms — Fault tolerance, recovery, vir tual machine, VMM, hyper visor, networ k, storage
منابع مشابه
Quest-V: A Virtualized Multikernel for High-Confidence Systems
This paper outlines the design of ‘Quest-V’, which is implemented as a collection of separate kernels operating together as a distributed system on a chip. Quest-V uses virtualization techniques to isolate kernels and prevent local faults from affecting remote kernels. This leads to a high-confidence multikernel approach, where failures of system subcomponents do not render the entire system in...
متن کاملEvaluating Multipath TCP Resilience against Link Failures
Standard TCP is the de facto reliable transfer protocol for the Internet. It is designed to establish a reliable connection using only a single network interface. However, standard TCP with single interfacing performs poorly due to intermittent node connectivity. This requires the re-establishment of connections as the IP addresses change. Multi-path TCP (MPTCP) has emerged to utilize multiple ...
متن کاملImproving Device Driver Reliability through Decoupled Dynamic Binary Analyses
Device drivers are Operating Systems (OS) extensions that enable the use of I/O devices in computing systems. However, studies have identified drivers as an Achilles’ heel of system reliability, their high fault rate accounting for a significant portion of system failures. Consequently, significant effort has been directed towards improving system robustness by protecting system components (e.g...
متن کاملA Case for Virtual Machine Based Fault Injection in a High-Performance Computing Environment
Large-scale computing platforms provide tremendous capabilities for scientific discovery. As applications and system software scale up to multipetaflops and beyond to exascale platforms, the occurrence of failure will be much more common. This has given rise to a push in fault-tolerance and resilience research for high-performance computing (HPC) systems. This includes work on log analysis to i...
متن کاملSR-IOV Networking in Xen: Architecture, Design and Implementation
SR-IOV capable network devices offer the benefits of direct I/O throughput and reduced CPU utilization while greatly increasing the scalability and sharing capabilities of the device. SR-IOV allows the benefits of the paravirtualized driver’s throughput increase and additional CPU usage reductions in HVMs (Hardware Virtual Machines). SR-IOV uses direct I/O assignment of a network device to mult...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013